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Abstract. In this chapter we provide an overview on some of the main issues in 
machine learning. We discuss machine learning both from a formal and a statistical 
perspective. We describe some aspects of machine learning such as concept learn- 
ing, support vector machines, and graphical models in more detail. We also present 
example machine learning applications to the Semantic Web. 
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Introduction 


The goal of Machine Learning (ML) is to construct computer programs that can learn 
from data. The inductive inference of machine learning, i.e. the generalizations from a set 
of observed instances, can be contrasted to early Artificial Intelligence (AI) approaches 
that dealt mostly with deductive inference (cf. Krétzsch et al. [24] in this volume), i.e., the 
derivation of theorems from axioms. Although ML is considered a subfield of AI it also 
intersects with many other scientific disciplines such as statistics, cognitive science, and 
information theory!. An area, closely related to ML is data mining [11,12] which deals 
with the discovery of new and interesting patterns from large data sets. Although ML and 
data mining are often used interchangeably, one might state that ML is more focused on 
adaptive behavior and operational use, whereas data mining focusses on handling large 
amounts of data and the discovery of previously unknown patterns (implicit knowledge, 
regularities) in the data. Most of this chapter discusses ML in the context of a formal AI 
system, although when suitable, as in the discussion of graphical models, we assume a 
more statistical perspective. 

ML approaches can be distinguished in terms of representation and adaptation. A 
machine learning system needs to store the learned information in some knowledge rep- 
resentation structure which is called (an inductive) hypothesis and is typically of the form 
of a model. Following the Ockham’s razor principle, the hypothesis should generalize the 
training data giving preference for the simplest hypothesis; to obtain valid generaliza- 
tion, the hypothesis should be simpler than the data itself. A learning algorithm specifies 
how to update the learned hypothesis with new experience (i.e. training data) such that 
the performance measure with regard to the task is being optimized (see Figure 1). 





‘Certainly there are many different perspectives on machine learning: some researchers see the strongest 
link with statistics and even claim that both fields are identical. 
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Figure 1. A generic machine learning method. 


Over the years, machine learning methods have been applied to solve many real- 
world problems such as spoken language recognition, fraud detection, customer relation- 
ship management, gene function prediction etc. To provide a concrete example where 
machine learning has been effective on a Web service, consider the task of categorizing 
email messages as spam or non-spam, where the performance of the machine learning 
method is assessed by the percentage of email messages correctly classified. The training 
experience in this problem may come in the form of the database of emails that has been 
labeled as spam or no-spam by humans. 

The rest of this chapter is organized as follows. In the next Section we discuss some 
basic aspects of machine learning and in Section 2 we discuss concept learning, the 
support vector machine and graphical models. In Section 3 we discuss applications of 
machine learning to the Semantic Web and Section 4 contains our conclusions. 


1. Machine Learning Basics 


In this section, we introduce the basic components of a machine learning problem. We 
discuss that learning algorithms are typically implemented as some kind of search and lo- 
cal optimization. The following Section 2 then describes three important machine learn- 
ing approaches in more detail. 


1.1. Tasks 


1.1.1, Classification and Regression 


The tasks of classification and regression deal with the prediction of the value of one 
field (the target) based on the values of the other fields (attributes or features). If the 
target is discrete (e.g. nominal or ordinal) then the given task is called classification. If 
the target is continuous, the task is called regression. Classification or regression nor- 
mally are supervised procedures: based on a previously correctly labeled set of training 
instances, the model learns to correctly label new unseen instances. 

An example classification problem may consist in predicting whether to grant or not 
to grant a credit to a customer. The values of the class c in this problem could be formed 
by a set {yes, no} representing a positive and a negative decision, respectively. The input 
to the classification method (that is, to a classifier) would consist of information about 
a customer. In particular, if the hypothesis space consists of rules, the output may be 
formed by a set of learned rules such as the one presented in Figure 2a. 
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1.1.2. Learning Associations 


An association describes a relation between objects, or measured quantities, that is the 
result of some interaction or of a dependency between the objects. Typically, the learned 
associations are in the form of association rules or sets of frequent items. The motivation 
for this type of task has been provided by market basket analysis where the methods for 
finding associations between products bought by customers are studied. For example, 
consider that customers who buy X (e.g. beer) typically also buy Y (e.g. chips); then, if 
we encounter a customer who buys X but does not buy Y, we may target this customer 
via cross-selling as a potential customer for Y. An itemset is called frequent if it appears 
in at least a given percentage (called support) of all transactions. Frequent itemsets are 
often the prerequisite for the learning of association rules. 


1.1.3. Clustering 


Clustering is an unsupervised task, whose aim is to group a set of objects into classes of 
similar objects. A cluster is a collection of objects that are similar to each other within 
the same cluster, and dissimilar to the objects in other clusters. Therefore, an important 
notion in clustering (also known as cluster analysis in statistics) is the notion of similar- 
ity (or distance). In conceptual clustering, a symbolic representation of each cluster is 
extracted and we may consider each cluster to be a concept, closely related to a class in 
classification. 


1.1.4, Other Machine Learning Tasks 


Some examples of other machine learning tasks are: reinforcement learning, learning to 
rank and structured prediction. 

The reinforcement learning task consists of learning sequential control strategies. It 
deals with situations, where the output of the system is a sequence of actions that are 
performed to achieve some goal. An example may be game playing, where the complete 
sequence of moves is important, rather than a single move. 

Learning to rank is a type of a (semi-)supervised learning problem where the goal 
is an automatic construction of a ranking model from training data, e.g., to learn to rank 
the importance of returned Web pages in a search application. 

Structured prediction deals with prediction problems in which the output is a com- 
plex structure. Such problems arise in disciplines such as computational linguistics, e.g. 
in natural language parsing, speech, vision, and biology. 


1.2. Training Data 


One distinguishes three important classes of feedback: feedback in the form of labeled 
examples, feedback in the form of unlabeled examples, or feedback in the form of reward 
and punishment, as in reinforcement learning. 

Supervised learning consists of learning a function from training examples, based 
on their attributes (inputs) and labels (outputs). Each training example is a pair (x, f(x)), 
where «x is the input, and f(a) is the output of the underlying unknown function. The 
aim of supervised learning is: given a set of examples of f, return a function h that best 
approximates f. For example, given symptoms and corresponding diagnoses for patients, 
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the goal is to learn a prediction model to make a diagnose based on symptoms for a new 
patient. 

Unsupervised learning is concerned with learning patterns in the input, without any 
output values available for training. Continuing our example, only symptoms of patients 
may be available and the goal may be to discover groups of similar patients. 

In semi-supervised learning, both labeled and unlabeled data is used for training, 
with typically only a small amount of labeled data, but a large amount of unlabeled data. 
In the clinical example, diagnoses might be available for only a few patients, and the goal 
would be to use this information for making most probable diagnoses for all patients. 

In reinforcement learning, input/output pairs are not available to the learning system. 
Instead, the learning system receives some sort of a reward after each action, and the 
goal is to maximize the cumulative reward for the whole process. For example, in case 
of treatment planning, the learning system may receive reinforcement from the patient 
(e.g., feels better, feels worse, cured) as an effect of actions taken during a treatment. 

Typically, a training data set comes in the simple form of attribute-value data ta- 
ble. However, more complex input data has also been studied, for example sequences, 
time series or graphs. From the point of view of this book, an interesting setting is pre- 
sented by Inductive Logic Programming (ILP) [33,36], originally defined as a subfield 
of machine learning that assumes Logic Programming as a representation formalism of 
hypotheses and background knowledge. Since then ILP has been further refined to a 
broader definition that considers not only Logic Programs as a representation, but also 
other subsets of the first-order logic. In particular, its methodology is well-suited for the 
tasks where Description Logics are used as representation formalism. The distinguishing 
feature of the ILP methods is their ability to take into account background knowledge, 
that is the knowledge generally valid in some domain, represented, for example, in the 
form of ontologies. 


1.3. Models 


Machine learning hypotheses may come in a variety of knowledge representation forms, 
such as equations, decision trees, rules, distances and partitions, probabilistic and graph- 
ical models. Classically, a division is made between symbolic and sub-symbolic forms of 
knowledge representation. The first category consists of representation systems in which 
the atomic building blocks are formal symbolic representations, often easily readable 
by a human. Such representation systems have compositional syntax and semantics, and 
their components may be assigned an interpretation. The system may, for example, be 
composed of a set of rules such as the one presented in Figure 2a. A good example of a 
symbolic system is an interpreted logical theory. 

In turn, the components of a sub-symbolic representation system do not have a clear 
interpretation, and are not formal representations by themselves. Knowledge in this ap- 
proach is represented as numerical patterns determining the computation of an output 
when being presented a given input. Good examples of sub-symbolic systems are neu- 
ral networks, where the patterns are represented in the form of interconnected groups of 
simple artificial neurons. 

Figure 2 provides an illustration of the distinction between symbolic and sub- 
symbolic representations. 

Typical machine learning algorithms induce models, that is hypotheses that charac- 
terize globally an entire data set. 
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IF income > 1500 THEN c= yes 
(a) 





(b) 


Figure 2. An illustration of a) symbolic and b) sub-symbolic representation. 
1.4. Generative and Discriminative Models 


So far the discussion was oriented towards a formal description of a learning problem. 
Here we describe a statistical view based on the concepts of generative and discriminative 
probabilistic models. 

Generative models simulate a data generating process. In an unsupervised learning 
problem, this would involve a model for P(X), i-e., a probabilistic model for generating 
the data.” In supervised learning, such as classification, one might assume that a class 
Y € {0,1} is generated with some probability P(Y) and the class-specific data is gen- 
erated via the conditional probability P(X|Y). Models are learned for both P(Y) and 
P(X|Y) and Bayes rule is employed to derive P(Y |X), i.e., the class label probability 
for a new input X. In contrast, discriminative models model the conditional probability 
distribution P(Y|X) directly, they learn a direct mapping from inputs X to class label 
probabilities. To illustrate the difference between generative and discriminative models 
let us discuss an example task consisting in determining the language of a given speaker. 
In a generative modeling approach this task would be solved by learning language mod- 
els for each language under consideration, i.e. by learning P(X|Y) for each language 
Y and by then applying Bayes rule to infer the language for a new text x. A discrimi- 
nate model would not bother modeling the distribution of texts but would focus on the 
task of language classification directly and could focus on only the differences between 
languages. Popular examples of generative models are Naive Bayes, Bayesian Networks 
and Hidden Markov Models. Popular examples of discriminative probabilistic models 
are logistic regression and support vector machines. 


1.5. Training Generative Models 


Since our description of graphical models is based on a generative modeling approach for 
an unsupervised model P(X), we briefly discuss the training of such models. In a simple 
maximum likelihood model, one assumes a model P(X |w), i.e. a probabilistic model for 
generating a data point given parameter vector w. The maximum likelihood parameters 
estimate is then defined by the parameters that maximize the likelihood, where the likeli- 
hood is L(w) defined as the product of the probabilities of generating the N independent 
training data points given the parameters and assumes the form 


N 


L(w) = |] P(X = ailw) 


i=l 





2In the discussion on generative models, X and Y stand for random variables. 
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In a Bayesian approach the model is completed with an a priori distribution over models 
P(M) and a prior distribution over parameters given model, i.e., P(w|M/). Based on 
observed data D, one can now calculate the most likely model as the one that maximizes 


P(M|D) = ao 


or the parameter distributions given model and data as 


P(w|D, My = Lo) P(wlM) 
P(D|M) 

A maximum a posteriori (MAP) estimate is achieved by taking the most likely model 

and selecting the parameters that maximize P(w|D, M). A more truthfully Bayesian ap- 

proach would consider the uncertainties in the estimates by integrating over unobserved 

quantities in the prediction. 


2. An Overview of Selected ML Approaches 


In this section, we provide an introduction to selected ML approaches, in particular to 
those that will be further referred to throughout the book. 


2.1. Concept Learning 


Concept learning consists of inducing the general definition of some concept (a category) 
given training examples labeled as members (positive examples) and nonmembers (neg- 
ative examples) of the concept. Each training example consists of an instance x € X and 
its target concept value f(a). Thus, a training example can be described by the ordered 
pair (x, f(a)). A learned concept is often represented as a boolean valued function. An 
example is called a positive one if f(x) = 1, and a negative one if f(a) = 0. Concept 
learning can be posed as a problem of searching the space of hypotheses to find a hy- 
pothesis best fitting the training data. The concept learning task may be thus formulated 
as follows [32]. Given instances 1 € X, a target concept f to be learned (X — {0, 1}), 
hypotheses H, described by a set of constraints they impose on instances, training exam- 
ples D (positive, and negative examples of the target function), determine a hypothesis 
h € H, such that h = f(a) for all x € X. Such a hypothesis, if learned on sufficiently 
large set of training examples, should also approximate the target concept well for new 
examples. 

By choosing the hypothesis representation, one determines the space of all hypothe- 
ses that can ever be learned by the given method. The hypothesis space may be ordered 
by a generality relation ~,. Let h; and h,; be two hypotheses. Then h; is more general or 
equal to h; (written h; =, h;) if and only if any instance satisfying h,, also satisfies h;, 
where an instance x € X is said to satisfy h € H if and only if h(x) = 1. The relation 
gq 18 a partial order (i.e., it is reflexive, antisymmetric and transitive) over the hypoth- 
esis space. Therefore, there may be also cases where two hypotheses are incomparable 
with ~,, what happens if the sets of instances satisfied by the hypotheses are disjoint or 
intersect (are not subsumed by one another). 
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Figure 3. An illustration of SVMs. a) a non-linear, circular concept b) linearly separable instances, margin, 
and support vectors (with thicker border). 


A hypothesis h is called consistent with a set of training examples D if it correctly 
classifies all these examples that is if and only if h(x) = f(x) for each training example 
(x, f(a)) in D. The set of all hypotheses consistent with the training examples is called 
the version space V S with respect to H and the training examples D. The version space 
VS may be represented by the sets of its maximally specific and maximally general 
hypotheses that delimit the entire set of hypotheses consistent with the data forming the 
boundaries of VS. 

Concept learning algorithms utilize a structure imposed over the hypothesis space 
by the relation ~, to efficiently search for relevant hypotheses. For example, Find-S al- 
gorithm [32] performs the search from most specific to most general hypotheses in order 
to find the most specific hypothesis consistent with the training examples, while Candi- 
date Elimination algorithm [32] exploits this general-to-specific ordering to compute the 
version space by an incremental computation of the sets of maximally specific and maxi- 
mally general hypotheses. The search for hypotheses is steered also by the inductive bias 
of a concept learning algorithm, that is the set of assumptions representing the nature 
of the target function used by the algorithm to predict outputs given previously unseen 
inputs. The learning algorithm implicitly makes assumptions on the correct output for 
unseen examples to select one consistent hypothesis over another. 

Some of concept learning algorithms proposed in the context of ILP that are relevant 
for this book are FOIL [39], and PROGOL [34]. They both induce first-order rules similar 
to Horn clauses. Concept learning is a very useful technique for ontology learning, and 
will be discussed in this context in more detail in (cf. Lehmann et al. [28] in this volume). 


2.2. Support Vector Machines 


Support Vector Machines (or SVMs) [2,3] may be used for binary classification or for 
regression. In binary classification, they construct a linear hyperplane (a decision bound- 
ary) to separate instances of one class from the other class. The separation between the 
classes is optimized by obtaining the separating hyperplane which is defined as the plane 
having the largest distance or margin to the nearest training data points of any class. Fig- 
ure 3 illustrates the general concept of SVMs. Mathematically, the separating hyperplane 
may be defined by the equation: 
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w:-x+b=0 


where w is a weight vector, b is a scalar, and w - x is the inner product between w and x. 

The distance of the closest points to the decision boundaries defines the margin, 
which is maximized during training. The optimization problem solved by SVMs may be 
formulated as follows: 


MiNw,bW > Ww 
where Vx € D: f(x)(w-x+b)>1, 
D is aset of examples, and f(x) = 1 for c; = 1, and f(x) = —1 forc¢; = 0 


It can be shown that the margin is Te? where ||w]| is the Euclidean norm of w. Thus 
minimizing ||w]||, maximizes the margin. The optimization problem is usually not solved 
as posed in the Equation 1, but rewritten into a dual problem known as a constrained 
(convex) quadratic optimization problem, that can be solved by public domain quadratic 
programming solvers (for the details see for example [9]). 

A new input x can be labeled as positive (1) or negative (0) based on whether it falls 
on or “above” the hyperplane: 


w-x+0>0,forc; = 1,and 
w-x+b<0,forc; =0 





SVMs can also form nonlinear classification boundaries. For this purpose, the origi- 
nal input data is transformed into a higher dimensional space by a nonlinear mapping @, 
and a linear separating hyperplane in this new space is formed. Fortunately, the nonlinear 
mapping function ¢ does not need to be specified explicitly. While solving the quadratic 
optimization problem of the linear SVM, the training tuples occur only in the form of dot 
products, 6(x;) - 6(x;). Then, instead of computing the dot product on the transformed 
tuples, it is mathematically equivalent to apply a kernel function K (x;, x;) to the original 
input data, where the kernel is defined as: 


K (xi, xj) = O(xi) - O(x;) 


SVMs have attracted a lot of attention in recent years since they are less likely to suffer 
from overfitting than other methods. A drawback is that training scales unfavorable with 
the size of the training data set. This section mainly followed [11,40]. A good introduc- 
tion to SVM methods may be also found in [3]. 


2.3. Learning in Graphical Models 


Given a set of M features or random variables, X1,...,X j,, graphical models are a 
means to efficiently describe their joint probability distribution P(X,,...,X ar) by ex- 
ploiting independencies in P(-). We can consider graphical models as generative models, 
modeling the statistical dependencies among a potentially large number of variables. In 
terms of a knowledge representation, X; might stand for the truth value of a statement, 
and X; = 1 if the statement is true and X; = 0 otherwise. 

In graphical models, independencies in the model can be displayed in form of a 
graph. We will consider the two most important subclasses of graphical models here, 
i.e., Bayesian networks (a.k.a directed graphical models) and Markov networks (a.k.a 
undirected graphical models). 
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2.3.1. Bayesian Networks 


The Basics Any probability distribution of 7 random variables can be decomposed in 
product form as 


M 
P(X1,..., Xu) = |] PUGLXs}; <8) 


i=l 


where the order of the variables is arbitrary. 
Bayesian networks exploit the fact that the set of all predecessors can be reduced to 
a set of parent nodes 


M M 


T] PCG{%3}5<0) = T] PXilpar(x;)) 


i=l i=l 


where par(X;) C {X;}j<;, thus exploiting independencies in a domain. In a graphical 
representation, nodes represent the random variables and one draws directed links from 
parent nodes to child node. Although the decomposition is possible in any order, most 
independencies are typically exploited when a causal ordering is observed, i.e, when the 
parents of a node also correspond to its direct causal probabilistic factors. It is of great 
importance to exploit domain independencies, since otherwise inference and other oper- 
ations would require resources exponential in the number of variables. Consider, as an 
example, a diagnostic setting, where the random variables stand for diagnosis, symptoms 
and influencing factors. Influencing factors (e.g., smoking) are probabilistic causal fac- 
tors for diseases, and diseases are probabilistic causal factors for symptoms. After the 
conditional probabilities P(X;|par(X;)) are defined by a medical expert, probabilistic 
inference can be used to calculate the probabilities of a disease given symptoms and 
given influencing factors. 

One typically has the case that a certain probabilistic dependency appears several 
times in the training data. For example, this could describe the dependency of fever given 
flue, and this dependency is assumed identical for all patients. We use P* (X;|par(X;)) 
where i € I(k) to indicate that the dependency between X; and its parents is of type k 
and where P*(-) now stands for a parameterized function. 

Learning can assume varying complexity. In the simplest case, the causal structure 
and all variables are known in training and the log-likelihood can be written as 


I(w) = log L(w) = S~ S~ log P*(X;\par(X;), w) 


k i€I(k) 


where w are model parameters. As we see, the log-likelihood nicely decomposes into 
sums which greatly simplifies the learning task. Typical learning approaches are max- 
imum likelihood learning, penalized maximum likelihood learning, and fully Bayesian 
learning. 

In the next level of complexity, some variables are unknown in the training data and 
some form of EM (expectation maximization) learning is typically applied. 

Finally, we might also assume that the causal structure is unknown. In the most com- 
mon approach one defines a cost function (e.g., Bayesian Information Criterion (BIC) 
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or marginal likelihood) and one does heuristic search by removing and adding directed 
links to find a structure that is at least locally optimal. Alternatively one performs statis- 
tical testing to discover independencies in the domain and one defines Bayesian network 
structures consistent with those (constraint-based approach). Care must be taken, that no 
directed loops are introduced in the Bayesian network in modeling or structural learning. 
A large number of models used in machine learning can be considered special cases of 
Bayesian networks, e.g., Hidden Markov models, Kalman filters, which is the reason for 
the great importance of Bayesian networks. Readers, who are interested to learn more 
should consult the excellent tutorial [13]. 


Modeling Relationships Traditionally, Bayesian networks have mostly been applied to 
attribute-based representations. Recently, there has been increasing interest to applying 
Bayesian networks to domains with object-to-object relationships for which the term sta- 
tistical relational learning (SRL) is used. Relationships add to complexity. In an attribute- 
based setting, one often assumes that objects are sampled independently, which greatly 
simplifies inference. For example the wealth of a person can be predicted from income 
and value of the person’s home but, given this independence sampling assumption, is in- 
dependent from the wealth of other people (given parameters). As a consequence, infer- 
ence can be performed separately for each person. In SRL, one could also consider the 
wealth of this person’s friends. As a consequence, random variables become globally de- 
pendent and inference often has to be performed globally as well, in this example poten- 
tially considering the wealth of all persons in the domain. A second issue is that directed 
loops become more problematic: I cannot easily model that my wealth depends on my 
friends’s wealth and vice versa without introducing directed loops, which are forbidden 
in Bayesian networks. Finally, aggregation plays a more important role. For example, I 
might want to model that a given teacher is a good teacher, if the teacher’s students get 
good grades in the classes the teacher teaches. This last quantity is probably not repre- 
sented in the raw data as a random variable but needs to be calculated in a preprocessing 
step. As one might suspect, aggregation tends to make structural learning more complex. 

Probabilistic Relational Models (PRMs) were one of the first published approaches 
for SRL with Bayesian networks and found great interest in the statistical machine learn- 
ing community [23,10]. PRMs combine a frame-based logical representation with prob- 
abilistic semantics based on directed graphical models. Parameter learning in PRMs is 
likelihood based or based on empirical Bayesian learning. Structural learning typically 
uses a greedy search strategy, where one needs to guarantee that the ground Bayesian 
network does not contain directed loops. 

Another important approach is presented by the infinite hidden relational model 
(IHRM) [50].° Here each object is represented by a discrete latent variable. The parents 
of a node representing a statement are the latent variables of all the objects involved. 
Thus, if X,,,m stands for the statement that user u likes movie m, then we would obtain 
the term P(Xu.m|X!,X!) where X! is the latent variable for user u and where X/., is 
the latent variable for user m. The resulting network has by construction no loops. Also 
the need for aggregation is alleviated since information can propagate in the network of 
latent variables. Finally, no structural learning is required as the structure is given by the 
typed relations in the domain. The IHRM is applied in the context of a Dirichlet process 
mixture model where the number of states is automatically tuned in the sampling pro- 





3Kemp et al. [20] presented an almost identical model independently. 
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cess. In [42] it was shown how ontological class information can be integrated into the 
THRM and in [44] it is shown how OWL constraints can be integrated. 

Bayesian networks are also being used in ILP where they form the basis for combi- 
nations of rule representations with SRL. A Bayesian logic program is defined as a set 
of Bayesian clauses [21]. A Bayesian clause specifies the conditional probability distri- 
bution of a random variable given its parents on a template level, i.e. in a node-class. 
A special feature is that, for a given random variable, several such conditional proba- 
bility distributions might be given. For each clause there is one conditional probability 
distribution and for each Bayesian predicate (i.e., node-class) there is one combination 
rule. Relational Bayesian networks [17] are related to Bayesian logic programs and use 
probability formulae for specifying conditional probabilities. 


2.3.2. Markov Networks 


The Basics The most important parameterization of the probability distribution of a 
Markov network is 


1 


P(X1,...,Xm) = S exp) wefr({X}x) 
k 


where the feature functions f;, can be any real-valued function, where {X}, © 
{X1,...,Xy¢} and where w;, € R. Z normalizes the distribution. In a graphical rep- 
resentation, all variables in {X}, would be mutually connected by undirected links and 
thus would form cliques in the resulting graph. 

We consider first that the feature functions f;,(.) are given. Learning then consists 
of estimating the w;,.. The log-likelihood is 


iw) = —logZ + S° we fx({X}x) 
k 


Even a simple maximum likelihood estimate leads to a non-trivial optimization problem, 
since Z is a function of all the parameters in the model. 

The more complex question is, how application specific feature functions f;,(.) can 
be defined. In Markov logic networks, as described next, the feature functions are derived 
from logical constraints. 


Markov Logic Networks (MLN) Let us consider a simple example with two friends A 
and B. Let X4,, = 1 and Xz, = 1 stand for the facts that person A is rich, respec- 
tively person B. Let X4,) = 1 and Xg,p = 1 stand for the facts that person A is poor, 
respectively person B. Then we define the feature function f,.,.(X4,,,X4,r) which is 
only equal to one if X4,, = 1 and Xg,, = 1 and is equal to zero else. Similarly, we 
define frp, fpr, frp. After training, we might obtain the weights w,,, = 10, wp,» = 10, 
Wryp = —5, Wp,r = —5. Thus a situation where both friends are both rich or both are 
poor is much more likely (for example, P(X 4, = 1,Xg, = 1Xap =90,XByp = 
0) = Z~' exp 10) than the situation where only one of the is rich and the other one is 
poor (for example, P(X 4, = 1, Xp, = 0,X 4,» = 0, Xp) = 1) = Z| exp—5). We 
can consider that the features were derived from the logical expressions, X 4, \ XB,r, 
Xap \ XB yp: XAr \ XB» X Ap \ XB,r- Obviously this knowledge-base is not even 
consistent, but for MLNs this would not hurt. After learning, only the true statements 
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would survive with w — oo. Even better, statements which are often but not always 
true would obtain weights which reflect this frequency. This basic ideas is formalized in 
MLNs. 

Let F;, be a formula of first-order logic and let w, € R be a weight attached to each 
formula. Then a MLN L is defined as a set of pairs (Fy, wz) [45] [7]. One introduces a 
binary node for each possible grounding of each predicate appearing in L, given a set of 
constants ¢),..., jc). The state of the node is equal to 1 if the ground atom/statement is 
true, and 0 otherwise (for an Q-ary predicate there are |C|® such nodes). A grounding 
of a formula is an assignment of constants to the variables in the formula (considering 
formulas that are universally quantified). If a formula contains Q variables, then there 
are |C|® such assignments. The nodes in the Markov network Myc are the grounded 
predicates. In addition the MLN contains one feature for each possible grounding of 
each formula F;, in L. The value of this feature is 1 if the ground formula is true, and 
0 otherwise. w, is the weight associated with F;, in L. A Markov network Mz ¢ is a 
grounded Markov logic network of L with 


P({X}=#)= Fer (= nul) 
k 


where n;(Z) is the number of formula groundings that are true for Fy. MLN makes 
the unique names assumption, the domain closure assumption and the known function 
assumption, but all these assumptions can be relaxed. 

A MLN puts weights on formulas: the larger the weight, the higher is the confidence 
that a formula is true. When all weights are equal and become infinite, one strictly en- 
forces the formulas and all worlds that agree with the formulas have the same probability. 

The simplest form of inference concerns the prediction of the truth value of a 
grounded predicate given the truth values of other grounded predicates (conjunction of 
predicates) for which the authors present an efficient algorithm. In the first phase, the 
minimal subset of the ground Markov network is returned that is required to calculate 
the conditional probability. It is essential that this subset is small since in the worst case, 
inference could involve alle nodes. In the second phase Gibbs sampling in this reduced 
network is used. 

Learning consists of estimating the w,;. In learning, MLN makes a closed-world 
assumption and employs a pseudo-likelihood cost function, which is the product of the 
probabilities of each node given its Markov blanket. Optimization is performed using a 
limited memory BFGS algorithm. 

Finally, there is the issue of structural learning, which, in this context, defines the 
employed first order formulae. Some formulae are typically defined by a domain expert a 
priori. Additional formulae can be learned by directly optimizing the pseudo-likelihood 
cost function or by using ILP algorithms. For the latter, the authors use CLAUDIEN [41], 
which can learn arbitrary first-order clauses (not just Horn clauses, as many other ILP 
approaches). 
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3. Applications of ML to the Semantic Web 


The interest of applying ML techniques in the Semantic Web context has been growing 
over recent years, and at the main Semantic Web conferences special ML tracks and 
workshops have been formed’. 

One of the applications of ML techniques that is discussed throughout this book is 
ontology learning. ML techniques may be used to both learn ontologies from scratch 
and enrich already existing ontologies. Learning data originates, e.g., from Linked Data, 
social networks, tags, textual data [30,8,27]. Another popular use of ML is the learning 
of the mapping from one ontology to another (e.g. based on association rules [5], or 
similarity-based methods [6] ). 

A number of proposed approaches for Learning from the Semantic Web are based 
on ILP methods (e.g., classification [27] or association learning [19,26]). This kind of 
approach is supported by recently developed tools for ontology-based data mining such 
as DL-Learner [27]>, RMonto [38]°) or SDM-Toolkit [49]’. An interesting application 
of this kind was realized within the project e-LICO®. It consisted of optimizing knowl- 
edge discovery processes through ontology-based meta-learning, that is machine learn- 
ing from meta data of executed past experiments, where meta data was represented with 
background ontologies [14]?. [29] describes a perspective of ILP for the Semantic Web. 

Ontology learning and ILP assume deterministic or close-to-deterministic depen- 
dencies. The increase of interest in ML techniques has arisen largely due to the open, 
distributed and inherently incomplete nature of the Semantic Web. Such a context makes 
it hard to apply purely deductive techniques, which traditionally have been dominating 
reasoning approaches for ontological data. As part of the LarKC project!° a scalable ma- 
chine learning approach has been developed that works well with the high-dimensional, 
sparse, and noisy data one encounters in those domains [47,15]. The approach is based 
on matrix factorization and has shown superior performance on a number of Seman- 
tic Web data sets [16]. Extensions have been developed that can take into account tem- 
poral effects and can model sequences [48] and can include ontological background 
and textual information [18]. The approach was part of the winning entry in the ISWC 
2011 Semantic Web Challenge!!. Tensor factorization is another promising direction, 
since the subject-predicate-object structure of the Semantic Web matches perfectly to the 
modes of a three-way tensor [35]. Another light-weighted approach is presented by [22], 
where the authors describe SPARQL-ML, a framework for adding data mining support 
to SPARQL. The approach uses relational bayes classifier (RBC) and relational proba- 
bilistic trees (RPT). In turn, [25] proposes to semantically group SPARQL query results 
via conceptual clustering. 

An overview on early work on the application of ML to the Semantic Web can 
be found in [46] with applications described in [37]. Data mining perspectives for the 





‘Inductive Reasoning and Machine Learning for the Semantic Web (IRMLeS) workshops, http: // 
irmles.di.uniba.it 

Shttp://aksw.org/Projects/DLLearner 

®nttp://semantic.cs.put.poznan.pl/RMonto 

Thttp://sourceforge.net/p/sdmtoolkit/ 

Shttp://www.e-lico.eu 

°nttp: //www.dmo-foundry.org 

nttp://www.larkc.eu 

lhttp://challenge.semant icweb.org 
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Semantic Web have been described by [1,31]. More recent overviews are presented in 
[4] and in [43]. 


4. Conclusions 


In this chapter we have discussed machine learning as the basis for the remaining chap- 
ters in this book. With an increasing amount of data published in the format of the Seman- 
tic Web, we feel that the number of machine learning applications will certainly grow. 
Due to space limitations we could only cover a few aspects we felt are most relevant. By 
now machine learning is a large research areas with a multitude of theories, approaches 
and algorithms. We feel that there will not be one dominating approach towards machine 
learning on the Semantic Web but that we can expect creative solutions from different 
machine learning research areas. 
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